Protein-Protein Interaction Extraction: A Supervised Learning Approach}
نویسندگان
چکیده
In this paper, we propose using Maximum Entropy to extract protein-protein interaction information from the literature, which overcomes the limitation of the state of art co-occurrence based and rule-based approaches. It incorporates corpus statistics of various lexical, syntactic and semantic features. We find that the use of shallow lexical features contributes a large portion of performance improvements in contrast to the use of parsing or partial parsing information. Yet such lexical features have never been used before in other PPI extraction systems. As a result, such a new approach achieves a very encouraging result of 93.9% recall and 88.0% precision on IEPA corpus provided. To the best of our knowledge, not only is this the first systematic study of supervised learning and the first attempt of feature-based supervised learning for PPI extraction, but it also provides useful features, such as surrounding words, key words and abbreviations, to extend the supervised learning capability for relation extraction to other domains such as news.
منابع مشابه
Improving Distantly Supervised Extraction of Drug-Drug and Protein-Protein Interactions
Relation extraction is frequently and successfully addressed by machine learning methods. The downside of this approach is the need for annotated training data, typically generated in tedious manual, cost intensive work. Distantly supervised approaches make use of weakly annotated data, like automatically annotated corpora. Recent work in the biomedical domain has applied distant supervision fo...
متن کاملSemi-Supervised Classification for Extracting Protein Interaction Sentences using Dependency Parsing
We introduce a relation extraction method to identify the sentences in biomedical text that indicate an interaction among the protein names mentioned. Our approach is based on the analysis of the paths between two protein names in the dependency parse trees of the sentences. Given two dependency trees, we define two separate similarity functions (kernels) based on cosine similarity and edit dis...
متن کاملWeakly Labeled Corpora as Silver Standard for Drug-Drug and Protein-Protein Interaction
Institute for Computer Science Humboldt-Universität zu Berlin Unter den Linden 6 10099 Berlin Germany Fraunhofer Institute for Algorithms and Scientific Computing (SCAI) Schloss Birlinghoven 53754 Sankt Augustin Germany Bonn-Aachen Center for Information Technology (B-IT) Dahlmannstraße 2 53113 Bonn Germany {tbobic,klinger,hofmann-apitius}@scai.fraunhofer.de {thomas,leser}@informatik.hu-berlin....
متن کاملSemi-supervised learning of the hidden vector state model for extracting protein-protein interactions
OBJECTIVE The hidden vector state (HVS) model is an extension of the basic discrete Markov model in which context is encoded as a stack-oriented state vector. It has been applied successfully for protein-protein interactions extraction. However, the HVS model, being a statistically based approach, requires large-scale annotated corpora in order to reliably estimate model parameters. This is nor...
متن کاملA Hierarchical n-Grams Extraction Approach for Classification Problem
We are interested in protein classification based on their primary structures. The goal is to automatically classify proteins sequences according to their families. This task goes through the extraction of a set of descriptors that we present to the supervised learning algorithms. There are many types of descriptors used in the literature. The most popular one is the n-gram. It corresponds to a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005